{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP Course 2 Week 1 Lesson : Building The Model - Lecture Exercise 02\n",
    "Estimated Time: 20 minutes\n",
    "<br>\n",
    "# Candidates from String Edits\n",
    "Create a list of candidate strings by applying an edit operation\n",
    "<br>\n",
    "### Imports and Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T14:58:26.992103Z",
     "start_time": "2021-04-21T14:58:26.988083Z"
    }
   },
   "outputs": [],
   "source": [
    "# data\n",
    "word = 'dearz' # 🦌"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Splits\n",
    "Find all the ways you can split a word into 2 parts !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T14:58:28.573007Z",
     "start_time": "2021-04-21T14:58:28.560912Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['', 'dearz']\n",
      "['d', 'earz']\n",
      "['de', 'arz']\n",
      "['dea', 'rz']\n",
      "['dear', 'z']\n",
      "['dearz', '']\n"
     ]
    }
   ],
   "source": [
    "# splits with a loop\n",
    "splits_a = []\n",
    "for i in range(len(word)+1):\n",
    "    splits_a.append([word[:i],word[i:]])\n",
    "\n",
    "for i in splits_a:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T14:59:07.152367Z",
     "start_time": "2021-04-21T14:59:07.146331Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('', 'dearz')\n",
      "('d', 'earz')\n",
      "('de', 'arz')\n",
      "('dea', 'rz')\n",
      "('dear', 'z')\n",
      "('dearz', '')\n"
     ]
    }
   ],
   "source": [
    "# same splits, done using a list comprehension\n",
    "splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n",
    "\n",
    "for i in splits_b:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete Edit\n",
    "Delete a letter from each string in the `splits` list.\n",
    "<br>\n",
    "What this does is effectivly delete each possible letter from the original word being edited. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T14:59:50.808691Z",
     "start_time": "2021-04-21T14:59:50.793977Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word :  dearz\n",
      "earz  <-- delete  d\n",
      "darz  <-- delete  e\n",
      "derz  <-- delete  a\n",
      "deaz  <-- delete  r\n",
      "dear  <-- delete  z\n"
     ]
    }
   ],
   "source": [
    "# deletes with a loop\n",
    "splits = splits_a\n",
    "deletes = []\n",
    "\n",
    "print('word : ', word)\n",
    "for L,R in splits:\n",
    "    if R:\n",
    "        print(L + R[1:], ' <-- delete ', R[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's worth taking a closer look at how this is excecuting a 'delete'.\n",
    "<br>\n",
    "Taking the first item from the `splits` list :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T15:02:00.858472Z",
     "start_time": "2021-04-21T15:02:00.848834Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "word :  dearz\n",
      "first item from the splits list :  ['', 'dearz']\n",
      "L :  \n",
      "R :  dearz\n",
      "*** now implicit delete by excluding the leading letter ***\n",
      "L + R[1:] :  earz  <-- delete  d\n"
     ]
    }
   ],
   "source": [
    "# breaking it down\n",
    "print('word : ', word)\n",
    "one_split = splits[0]\n",
    "print('first item from the splits list : ', one_split)\n",
    "L = one_split[0]\n",
    "R = one_split[1]\n",
    "print('L : ', L)\n",
    "print('R : ', R)\n",
    "print('*** now implicit delete by excluding the leading letter ***')\n",
    "print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the end result transforms **'dearz'** to **'earz'** by deleting the first character.\n",
    "<br>\n",
    "And you use a **loop** (code block above) or a **list comprehension** (code block below) to do\n",
    "<br>\n",
    "this for the entire `splits` list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T15:03:21.889135Z",
     "start_time": "2021-04-21T15:03:21.879648Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['earz', 'darz', 'derz', 'deaz', 'dear']\n",
      "*** which is the same as ***\n",
      "earz\n",
      "darz\n",
      "derz\n",
      "deaz\n",
      "dear\n"
     ]
    }
   ],
   "source": [
    "# deletes with a list comprehension\n",
    "splits = splits_a\n",
    "deletes = [L + R[1:] for L, R in splits if R]\n",
    "\n",
    "print(deletes)\n",
    "print('*** which is the same as ***')\n",
    "for i in deletes:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ungraded Exercise\n",
    "You now have a list of ***candidate strings*** created after performing a **delete** edit.\n",
    "<br>\n",
    "Next step will be to filter this list for ***candidate words*** found in a vocabulary.\n",
    "<br>\n",
    "Given the example vocab below, can you think of a way to create a list of candidate words ? \n",
    "<br>\n",
    "Remember, you already have a list of candidate strings, some of which are certainly not actual words you might find in your vocabulary !\n",
    "<br>\n",
    "<br>\n",
    "So from the above list **earz, darz, derz, deaz, dear**. \n",
    "<br>\n",
    "You're really only interested in **dear**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-21T15:03:28.267652Z",
     "start_time": "2021-04-21T15:03:28.257701Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vocab :  ['dean', 'deer', 'dear', 'fries', 'and', 'coke']\n",
      "edits :  ['earz', 'darz', 'derz', 'deaz', 'dear']\n",
      "candidate words :  []\n"
     ]
    }
   ],
   "source": [
    "vocab = ['dean','deer','dear','fries','and','coke']\n",
    "edits = list(deletes)\n",
    "\n",
    "print('vocab : ', vocab)\n",
    "print('edits : ', edits)\n",
    "\n",
    "candidates=[]\n",
    "\n",
    "### START CODE HERE ###\n",
    "#candidates = ??  # hint: 'set.intersection'\n",
    "### END CODE HERE ###\n",
    "\n",
    "print('candidate words : ', candidates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Expected Outcome:\n",
    "\n",
    "vocab :  ['dean', 'deer', 'dear', 'fries', 'and', 'coke']\n",
    "<br>\n",
    "edits :  ['earz', 'darz', 'derz', 'deaz', 'dear']\n",
    "<br>\n",
    "candidate words :  {'dear'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Summary\n",
    "You've unpacked an integral part of the assignment by breaking down **splits** and **edits**, specifically looking at **deletes** here.\n",
    "<br>\n",
    "Implementation of the other edit types (insert, replace, switch) follow a similar methodology and should now feel somewhat familiar when you see them.\n",
    "<br>\n",
    "This bit of the code isn't as intuitive as other sections, so well done!\n",
    "<br>\n",
    "You should now feel confident facing some of the more technical parts of the assignment at the end of the week."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
